It is critical for any company to understand risk and then minimize it. A credit card company gathers a lot of information in order to accomplish this. The "risk" in this case is the likelihood of any applicant defaulting on credit card borrowings. Thanks to a dataset downloaded from Kaggle I want to test my understanding of machine learning techniques, and try to build a model capable of classify any applicant as "good" or "bad" (else low and high risk).
I have a lot of historical and personal data, but I don't have a good "target" variable to predict. The first part of the project involves investigating the credit record and developing an algorithm to generate a "HIGH RISK" target feature, which is a categorical variable where a "high risk" applicant is anyone who has, at least once, failed to pay any debts within 60 days.
After figuring out how to divide all the data into the two risk categories, the next step will be to bring the applicant data and the target variable together in one dataframe. Then a proper statistical analysis will be put in place in order to understand if there is a common pattern among the low and high-risk credit card users. But before that, the dataframe must be divided into a training set and a testing set. The test set will prove itself useful later.
Once know the data, the project can really start. First, we are going to use our knowledge gained from the EDA in order to find the right way to clean and polish our dataframe. Which feature to drop, which to tweak (by fixing the skewness of their distributions, by removing outliers, or by normalizing them). A list of models will be trained on the preprocessed data, and the respective performances will be compared in order to find the model best suited for our goal.
Finally, our best model will be used on our unseen test data, to find out how well it can perform.
import pandas as pd
pd.set_option('display.max_columns', 200)
pd.options.display.float_format = '{:0,.3f}'.format
from pandas.core.common import SettingWithCopyWarning
import warnings
warnings.simplefilter(action='ignore', category=(SettingWithCopyWarning, FutureWarning))
import operator
from pandas_profiling import ProfileReport
import numpy as np
import missingno
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="white", palette=None)
from scipy.stats import chi2_contingency
import scipy.stats as stats
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, roc_curve
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from xgboost import XGBClassifier
import scikitplot as skplt
from yellowbrick.model_selection import FeatureImportances
from imblearn.over_sampling import SMOTE
import joblib
import os
%matplotlib inline
All functions are grouped at the start in order to make the notebook more readable.
#Function to split the data into train and test sets
def data_split(df, test_size):
train_df, test_df = train_test_split (df, test_size = test_size, random_state = 42)
return train_df.reset_index(drop=True), test_df.reset_index(drop=True)
#Function to 'enhance' pandas .describe, by adding skew and kurtosis
def describe(df, stats):
d = df.describe(include='all', percentiles=[0.25,0.5,0.75,0.99])
return d.append(df.reindex(d.columns, axis = 1).agg(stats))
#Function that will return value count and frequency for each feature
def value_count(feature):
ftr_value_count = eda_df[feature].value_counts()
ftr_freq = eda_df[feature].value_counts(normalize=True) * 100
ftr_concat = pd.concat([ftr_value_count, ftr_freq], axis=1)
ftr_concat.columns = ['Count', 'Frequency (%)']
return ftr_concat
def value_count_high(feature):
ftr_value_count = high_df[feature].value_counts()
ftr_freq = high_df[feature].value_counts(normalize=True) * 100
ftr_concat = pd.concat([ftr_value_count, ftr_freq], axis=1)
ftr_concat.columns = ['Count', 'Frequency (%)']
return ftr_concat
#Function to call the describe function for a specific feature
def gen_info(feature):
match feature:
case 'AGE'|'EMPLOYMENT LENGHT'|'ACCOUNT AGE'|'ANNUAL INCOME':
print('*'*55)
print('Description:\n{}'.format(eda_df[feature].describe()))
print('*'*55)
case _:
print('*'*55)
print('Description:\n{}'.format(eda_df[feature].describe()))
print('*'*55)
x = value_count(feature)
print(f'Value count:\n{x}')
print('*'*55)
#Function to call the describe function for a specific feature, for high risk applicants
def high_info(feature):
match feature:
case 'AGE'|'EMPLOYMENT LENGHT'|'ACCOUNT AGE'|'ANNUAL INCOME':
print('*'*55)
print('Description:\n{}'.format(high_df[feature].describe()))
print('*'*55)
case _:
print('*'*55)
print('Description:\n{}'.format(high_df[feature].describe()))
print('*'*55)
x = value_count_high(feature)
print(f'Value count:\n{x}')
print('*'*55)
#Function to draw a bar plot
def draw_bar_plot(feature):
match feature:
case 'GENDER'|'HAS A CAR'|'OWNS REAL ESTATE'|'HAS A MOBILE PHONE'|'HAS A WORK PHONE'|'HAS A PHONE'|'HAS AN EMAIL'|'HIGH RISK':
sns.set(rc={'figure.figsize':(5, 10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.barplot(y=value_count(feature).values[:,0],x=value_count(feature).index)
plt.title(f'{feature} COUNT', fontweight='bold')
return plt.show()
case 'OCCUPATION':
sns.set(rc={'figure.figsize':(20, 10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.barplot(x=value_count(feature).values[:,0],y=value_count(feature).index)
plt.title(f'{feature} COUNT', fontweight='bold')
return plt.show()
case _:
sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.barplot(x=value_count(feature).index,y=value_count(feature).values[:,0])
plt.title(f'{feature} COUNT', fontweight='bold')
return plt.show()
#Function to draw a box plot
def draw_box_plot(feature):
match feature:
case 'ANNUAL INCOME':
sns.set(rc={'figure.figsize':(5,10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.boxplot(y=eda_df[feature])
plt.title(f'{feature} FEATURE', fontweight='bold')
#remove scientific notation
plt.ticklabel_format(style='plain', axis='y')
return plt.show()
case _:
sns.set(rc={'figure.figsize':(5,10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.boxplot(y=eda_df[feature])
plt.title(f'{feature} FEATURE', fontweight='bold')
return plt.show()
#Function to draw a hist plot
def draw_hist_plot(feature):
match feature:
case 'ANNUAL INCOME':
sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.histplot(eda_df[feature], bins=49, kde=True)
plt.title(f'{feature} DISTRIBUTION', fontweight='bold')
#remove scientific notation
plt.ticklabel_format(style='plain', axis='x')
return plt.show()
case _:
sns.set(rc={"figure.figsize":(20, 10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.histplot(eda_df[feature], bins=49, kde=True)
plt.title(f'{feature} DISTRIBUTION', fontweight='bold')
return plt.show()
#Function to draw box plot to compare High Risk vs Low Risk
def high_low_box_plot(feature):
match feature:
case 'ANNUAL INCOME':
sns.set(rc={'figure.figsize':(8,10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.boxplot(y=eda_df[feature], x=eda_df['HIGH RISK'])
#plt.ticklabel_format(style='plain', axis='y')
plt.title(f'HIGH AND LOW RISK ON {feature}', fontweight='bold')
case _:
sns.set(rc={'figure.figsize':(8,10)}, style='darkgrid')
sns.set_theme(style="dark", palette='Set2')
sns.boxplot(y=eda_df[feature], x=eda_df['HIGH RISK'])
plt.title(f'HIGH AND LOW RISK ON {feature}', fontweight='bold')
return plt.show()
#Function to test statistically the influence of each feature on target
def chi_square_test(dict, feature):
HighRisk_feature = train_og_copy[train_og_copy['HIGH RISK']==1][feature]
cross = pd.crosstab(index=HighRisk_feature, columns=['Count']).rename_axis(None, axis=1)
cross.index.name = None
# observe values
obs = cross
print('*'*20 + ' ' + feature + ' ' +'*'*20)
print('Observed values:\n')
print(obs)
#expected values
exp = pd.DataFrame([obs['Count'].sum()/len(obs)] * len(obs.index),columns=['Count'], index=obs.index)
print('*'*55)
print('Expected values:\n')
print(exp)
print('\n')
# chi-square test
chi_squared_stat = (((obs-exp)**2)/exp).sum()
print('Chi-square:\n')
print(chi_squared_stat[0])
print('\n')
x=float(chi_squared_stat)
dict[feature]=x
#critical value (99% confidence level)
q = 0.99
crit = stats.chi2.ppf(q = q, df = len(obs) - 1)
print('Critical value:\n')
print(crit)
print('\n')
# p-value
p_value = 1 - stats.chi2.cdf(x = chi_squared_stat, df = len(obs) - 1)
print('P-value:\n')
print(p_value)
print('\n')
if p_value <= (1-q):
print(f'WE REJECT THE NULL HYPOTHESIS: THE FEATURE "{feature}" HAS EFFECT ON TARGET')
print('\n')
elif p_value > (1-q):
print(f'WE ACCEPT THE NULL HYPOTHESIS: THE FEATURE "{feature}" HAS NO EFFECT ON TARGET')
return dict
#Function to remove outliers
class RemoveOutliers(BaseEstimator,TransformerMixin):
def __init__(self,out_feature = ['FAMILY SIZE','ANNUAL INCOME', 'EMPLOYMENT LENGHT']):
self.out_feature = out_feature
def fit(self,df):
return self
def transform(self,df):
if (set(self.out_feature).issubset(df.columns)):
# 25% quantile
Q1 = df[self.out_feature].quantile(.25)
# 75% quantile
Q3 = df[self.out_feature].quantile(.75)
IQR = Q3 - Q1
# keep the data within 3 IQR
df = df[~((df[self.out_feature] < (Q1 - 3 * IQR)) |(df[self.out_feature] > (Q3 + 3 * IQR))).any(axis=1)]
return df
else:
print("One or more features are not in the dataframe")
return df
#Function to drop useless or correlated features
class DropFeatures(BaseEstimator,TransformerMixin):
def __init__(self,drop_feature = ['ID','# CHILDREN','HAS A MOBILE PHONE','OCCUPATION','ACCOUNT AGE']):
self.drop_feature = drop_feature
def fit(self,df):
return self
def transform(self,df):
if (set(self.drop_feature).issubset(df.columns)):
df.drop(self.drop_feature,axis=1,inplace=True)
return df
else:
print("One or more features are not in the dataframe")
return df
#One Hot Encoding
class OneHotEncoding(BaseEstimator,TransformerMixin):
def __init__(self, one_hot_feature = ['GENDER','HAS A CAR','OWNS REAL ESTATE','INCOME TYPE','FAMILY STATUS','RESIDENCE TYPE','HAS A PHONE','HAS A WORK PHONE','HAS AN EMAIL']):
self.one_hot_feature = one_hot_feature
def fit(self,df):
return self
def transform(self,df):
if (set(self.one_hot_feature).issubset(df.columns)):
#Function that actually encode the feature
def one_hot_encoding(df,one_hot_feature):
one_hot_encoding = OneHotEncoder()
one_hot_encoding.fit(df[one_hot_feature])
one_hot_feature_names = one_hot_encoding.get_feature_names_out(one_hot_feature)
df = pd.DataFrame(one_hot_encoding.transform(df[self.one_hot_feature]).toarray(),columns=one_hot_feature_names,index=df.index)
return df
def concat_with_df(df,one_hot_encoding_df,one_hot_feature):
rest_of_features = [ft for ft in df.columns if ft not in one_hot_feature]
df_concat = pd.concat([one_hot_encoding_df, df[rest_of_features]],axis=1)
return df_concat
one_hot_encoding_df = one_hot_encoding(df,self.one_hot_feature)
full_df = concat_with_df(df,one_hot_encoding_df,self.one_hot_feature)
return full_df
else:
print("One or more features are not in the dataframe")
return df
#Min-Max Scaler
class MinMax(BaseEstimator,TransformerMixin):
def __init__(self,min_max_feature=['ANNUAL INCOME','AGE','EMPLOYMENT LENGHT','FAMILY SIZE']):
self.min_max_feature=min_max_feature
def fit(self,df):
return self
def transform(self,df):
if (set(self.min_max_feature).issubset(df.columns)):
min_max_encoding = MinMaxScaler()
df[self.min_max_feature] = min_max_encoding.fit_transform(df[self.min_max_feature])
return df
else:
print("One or more features are not in the dataframe3")
return df
#Fix Skewness
class FixSkewness(BaseEstimator,TransformerMixin):
def __init__(self,skew_feature=['ANNUAL INCOME','AGE','EMPLOYMENT LENGHT','FAMILY SIZE']):
self.skew_feature=skew_feature
def fit(self,df):
return self
def transform(self,df):
if (set(self.skew_feature).issubset(df.columns)):
#Since all of our feature have positive skew, we are going to use a cubic root transformation
df[self.skew_feature] = np.cbrt(df[self.skew_feature])
return df
else:
print("One or more features are not in the dataframe")
return df
#Ordinal Encoding
class OrdinalEnc(BaseEstimator,TransformerMixin):
def __init__(self,ordinal_feature=['EDUCATION']):
self.ordinal_feature=ordinal_feature
def fit(self,df):
return self
def transform(self,df):
if 'EDUCATION' in df.columns:
ord_enc = OrdinalEncoder()
df[self.ordinal_feature]=ord_enc.fit_transform(df[self.ordinal_feature])
return df
else:
print("One or more features are not in the dataframe")
return df
#Function to change target dtype to 'numeric'
class ChangeToNumTarget(BaseEstimator,TransformerMixin):
def __init__(self):
pass
def fit(self,df):
return self
def transform(self,df):
if 'HIGH RISK' in df.columns:
df['HIGH RISK'] = pd.to_numeric(df['HIGH RISK'])
return df
else:
print("Is high risk is not in the dataframe")
return df
#Oversampling
class Oversample(BaseEstimator,TransformerMixin):
def __init__(self):
pass
def fit(self,df):
return self
def transform(self,df):
if 'HIGH RISK' in df.columns:
# smote function to oversample the minority class to fix the imbalance data
oversample = SMOTE(sampling_strategy='minority')
X_bal, y_bal = oversample.fit_resample(df.loc[:, df.columns != 'HIGH RISK'],df['HIGH RISK'])
df_bal = pd.concat([pd.DataFrame(X_bal),pd.DataFrame(y_bal)],axis=1)
return df_bal
else:
print("HIGH RISK is not in the dataframe")
return df
#Function to call the pipeline dedicated to clean the data
def DataPreprocessing(df):
pipeline=Pipeline([
('RemoveOutilers', RemoveOutliers()),
('DropFeature', DropFeatures()),
('FixSkewness', FixSkewness()),
('OneHot', OneHotEncoding()),
('Ordinal', OrdinalEnc()),
('MinMax', MinMax()),
('Numeric', ChangeToNumTarget()),
('Oversampling', Oversample()),
])
df_pipe_prep=pipeline.fit_transform(df)
return df_pipe_prep
#Function to get the feature importance of the classifier, and plot it
def feature_importance_plot(model, model_name):
if model_name not in ['sgd','gaussian_naive_bayes','k_nearest_neighbors','bagging']:
# top 10 most predictive features
top_10_feat = FeatureImportances(model, relative=False, topn=10)
# top 10 least predictive features
bottom_10_feat = FeatureImportances(model, relative=False, topn=-10)
#change the figure size
plt.figure(figsize=(10, 4))
#change x label font size
plt.xlabel('xlabel', fontsize=14)
# Fit to get the feature importances
top_10_feat.fit(X_train, y_train)
# show the plot
top_10_feat.show()
print('\n')
plt.figure(figsize=(10, 4))
plt.xlabel('xlabel', fontsize=14)
# Fit to get the feature importances
bottom_10_feat.fit(X_train, y_train)
# show the plot
bottom_10_feat.show()
print('\n')
else:
print(f'No feature importance for {model_name}')
print('\n')
#Function to get the y prediction
def y_prediction(model,model_name,final_model=False):
if final_model == False:
# check if y_train_copy_pred exists, if not create it
y_train_pred_path = Path(f'saved_models/{model_name}/y_train_copy_pred_{model_name}.sav')
try:
y_train_pred_path.resolve(strict=True)
except FileNotFoundError:
#cross validation prediction with kfold = 10
y_cc_train_pred = cross_val_predict(model,X_train,y_train,cv=10,n_jobs=-1)
#save the predictions
joblib.dump(y_cc_train_pred,y_train_pred_path)
return y_cc_train_pred
else:
# if it exist load the predictions
y_cc_train_pred = joblib.load(y_train_pred_path)
return y_cc_train_pred
else:
# check if y_train_copy_pred exists, if not create it
y_train_pred_path_final = Path(f'saved_models_final/{model_name}/y_train_copy_pred_{model_name}_final.sav')
try:
y_train_pred_path_final.resolve(strict=True)
except FileNotFoundError:
#cross validation prediction with kfold = 10
y_cc_train_pred_final = cross_val_predict(model,X_train,y_train,cv=10,n_jobs=-1)
#save the predictions
joblib.dump(y_cc_train_pred_final,y_train_pred_path_final)
return y_cc_train_pred_final
else:
# if it exist load the predictions
y_cc_train_pred_final = joblib.load(y_train_pred_path_final)
return y_cc_train_pred_final
#Function to plot the confusion matrix
def confusion_matrix(model,model_name,final_model=False):
if final_model == False:
fig, ax = plt.subplots(figsize=(8,8))
#plot confusion matrix
conf_matrix = ConfusionMatrixDisplay.from_predictions(y_train,y_prediction(model,model_name),ax=ax, cmap='GnBu',values_format='d')
# remove the grid
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])
plt.grid(visible=None, axis='both')
# increase the font size of the x and y labels
plt.xlabel('Predicted label', fontsize=14)
plt.ylabel('True label', fontsize=14)
#give a title to the plot using the model name
plt.title('Confusion Matrix', fontsize=14)
#show the plot
plt.show()
print('\n')
else:
fig, ax = plt.subplots(figsize=(8,8))
#plot confusion matrix
conf_matrix_final = ConfusionMatrixDisplay.from_predictions(y_train,y_prediction(model,model_name,final_model=True),ax=ax, cmap='GnBu',values_format='d')
# remove the grid
ax.grid(False)
ax.set_xticks([])
ax.set_yticks([])
plt.grid(visible=None)
# increase the font size of the x and y labels
plt.xlabel('Predicted label', fontsize=14)
plt.ylabel('True label', fontsize=14)
#give a title to the plot using the model name
plt.title('Confusion Matrix', fontsize=14)
#show the plot
plt.show()
print('\n')
#Function to plot the roc curve
def roc_curve(model,model_name,final_model=False):
if final_model == False:
# check if y probabilities file exists, if not create it
y_proba_path = Path(f'saved_models/{model_name}/y_cc_train_proba_{0}.sav')
try:
y_proba_path.resolve(strict=True)
except FileNotFoundError:
y_train_proba = model.predict_proba(X_train)
joblib.dump(y_train_proba,y_proba_path)
else:
# if path exist load the y probabilities file
y_train_proba = joblib.load(y_proba_path)
skplt.metrics.plot_roc(y_train, y_train_proba, title = f'ROC curve for {model_name}', cmap='cool',figsize=(8,6), text_fontsize='large')
#remove the grid
plt.grid(visible=None)
plt.show()
print('\n')
else:
# check if y probabilities file exists, if not create it
y_proba_path_final = Path(f'saved_models_final/{model_name}/y_cc_train_proba_{model_name}_final.sav')
try:
y_proba_path_final.resolve(strict=True)
except FileNotFoundError:
y_train_proba_final = model.predict_proba(X_train)
joblib.dump(y_train_proba_final,y_proba_path_final)
else:
# if path exist load the y probabilities file
y_train_proba_final = joblib.load(y_proba_path_final)
skplt.metrics.plot_roc(y_train, y_train_proba_final, title = f'ROC curve for {model_name}', cmap='cool',figsize=(8,6), text_fontsize='large')
#remove the grid
plt.grid(visible=None)
plt.show()
print('\n')
#Function to display the classification report
def score(model, model_name, final_model=False):
if final_model == False:
class_report = classification_report(y_train,y_prediction(model,model_name))
print(class_report)
else:
class_report_final = classification_report(y_train,y_prediction(model,model_name,final_model=True))
print(class_report_final)
#Function to train the model
def train_model(model,model_name,final_model=False):
# if we are not training the final model
if final_model == False:
# check if the model file exist and if not create, train and save it
model_file_path = Path(f'saved_models/{model_name}/{model_name}_model.sav')
try:
model_file_path.resolve(strict=True)
except FileNotFoundError:
if model_name == 'sgd':
# for sgd, loss = 'hinge' does not have a predict_proba method. Therefore, we use a calibrated model
calibrated_model = CalibratedClassifierCV(model, cv=10, method='sigmoid')
model_trn = calibrated_model.fit(X_train,y_train)
else:
model_trn = model.fit(X_train,y_train)
joblib.dump(model,model_file_path)
# plot the most and least predictive features
return model_trn
else:
# if path exist load the model
model = joblib.load(model_file_path)
# plot the most and least predictive features
return model
else:
# check if the final model file exist and if not create, train and save it
final_model_file_path = Path(f'saved_models_final/{model_name}/{model_name}_model.sav')
try:
final_model_file_path.resolve(strict=True)
except FileNotFoundError:
model = model.fit(X_train,y_train)
joblib.dump(model,final_model_file_path)
# plot the most and least predictive features
return model
else:
# if path exist load the model
model = joblib.load(final_model_file_path)
# plot the most and least predictive features
return model
#Function to check if the folder for saving the model exists, if not create it
def folder_check():
if not os.path.exists(f'saved_models/{model_name}'):
os.makedirs(f'saved_models/{model_name}')
#Read Csv Files
app_df = pd.read_csv('application_record.csv')
record_df = pd.read_csv('credit_record.csv')
#Flag as Risky account everyone that were at least once, past due over 30 days
record_df['High Risk'] = np.nan
record_df['High Risk'][record_df['STATUS'] == '2']='Yes'
record_df['High Risk'][record_df['STATUS'] == '3']='Yes'
record_df['High Risk'][record_df['STATUS'] == '4']='Yes'
record_df['High Risk'][record_df['STATUS'] == '5']='Yes'
#Group data by account ID
record_df = record_df.groupby('ID').count()
#Number of months with past due payments
record_df['PAST DUE MONTHS'] = record_df['High Risk']
record_df = record_df.rename(columns={'MONTHS_BALANCE':'ACCOUNT AGE'})
#Flag each account that were at least once past due
record_df['X'] = None
record_df['X'][record_df['High Risk'] == 0] = 0
record_df['X'][record_df['High Risk'] > 0] = 1
record_df.drop(['High Risk','STATUS'], axis=1, inplace=True)
record_df = record_df.rename(columns={'X':'HIGH RISK'})
#We are only interested in account age, risk ratio and the categorical variable 'High Risk'
record_df = record_df[['ACCOUNT AGE', 'HIGH RISK']]
#Merging data with applicants df, from now on we are working with this
app_df = pd.merge(app_df, record_df, how='inner', on='ID')
#Better columns name
app_df = app_df.rename(columns={'CODE_GENDER':'GENDER', 'FLAG_OWN_CAR':'HAS A CAR', 'CNT_CHILDREN':'# CHILDREN'})
app_df = app_df.rename(columns={'AMT_INCOME_TOTAL':'ANNUAL INCOME', 'NAME_INCOME_TYPE':'INCOME TYPE', 'NAME_EDUCATION_TYPE':'EDUCATION'})
app_df = app_df.rename(columns={'NAME_FAMILY_STATUS':'FAMILY STATUS', 'NAME_HOUSING_TYPE':'RESIDENCE TYPE', 'DAYS_BIRTH':'AGE'})
app_df = app_df.rename(columns={'DAYS_EMPLOYED':'EMPLOYMENT LENGHT', 'FLAG_MOBIL':'HAS A MOBILE PHONE', 'FLAG_WORK_PHONE':'HAS A WORK PHONE'})
app_df = app_df.rename(columns={'FLAG_PHONE':'HAS A PHONE', 'FLAG_EMAIL':'HAS AN EMAIL', 'OCCUPATION_TYPE':'OCCUPATION'})
app_df = app_df.rename(columns={'CNT_FAM_MEMBERS':'FAMILY SIZE', 'FLAG_OWN_REALTY':'OWNS REAL ESTATE'})
#Fix applicant's age, employment age and categorical features
app_df['AGE'] = np.trunc(np.abs(app_df['AGE']/365)).astype(np.int64)
#Pensioner Fix. Instead of 1000, let's assume that maximum amount of working years is 50
app_df['EMPLOYMENT LENGHT'] = np.trunc(np.abs(app_df['EMPLOYMENT LENGHT']/365)).astype(np.int64)
app_df = app_df.replace({'EMPLOYMENT LENGHT':{1000:50}})
app_df = app_df.replace({'HAS A MOBILE PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS A WORK PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS A PHONE':{0:'N',1:'Y'}})
app_df = app_df.replace({'HAS AN EMAIL':{0:'N',1:'Y'}})
app_df['FAMILY SIZE'] = app_df['FAMILY SIZE'].astype(np.int64)
train_og, test_og = data_split(app_df, 0.3)
#saving train and test sets
train_og.to_csv('dataset/train.csv', index=False)
test_og.to_csv('dataset/test.csv', index=False)
#creating a backup
train_og_copy = train_og.copy()
test_og_copy = test_og.copy()
#We are going to take a look only on train data
eda_df = train_og
eda_df = eda_df.replace({'HIGH RISK':{0:'N', 1:'Y'}})
high_df = eda_df[eda_df['HIGH RISK'] == 'Yes']
#Bird's eye view on our dataframe
describe(eda_df, ['skew', 'kurt'])
| ID | GENDER | HAS A CAR | OWNS REAL ESTATE | # CHILDREN | ANNUAL INCOME | INCOME TYPE | EDUCATION | FAMILY STATUS | RESIDENCE TYPE | AGE | EMPLOYMENT LENGHT | HAS A MOBILE PHONE | HAS A WORK PHONE | HAS A PHONE | HAS AN EMAIL | OCCUPATION | FAMILY SIZE | ACCOUNT AGE | HIGH RISK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 25,519.000 | 25519 | 25519 | 25519 | 25,519.000 | 25,519.000 | 25519 | 25519 | 25519 | 25519 | 25,519.000 | 25,519.000 | 25519 | 25519 | 25519 | 25519 | 17589 | 25,519.000 | 25,519.000 | 25519 |
| unique | NaN | 2 | 2 | 2 | NaN | NaN | 5 | 5 | 5 | 6 | NaN | NaN | 1 | 2 | 2 | 2 | 18 | NaN | NaN | 2 |
| top | NaN | F | N | Y | NaN | NaN | Working | Secondary / secondary special | Married | House / apartment | NaN | NaN | Y | N | N | N | Laborers | NaN | NaN | N |
| freq | NaN | 17117 | 15876 | 17159 | NaN | NaN | 13145 | 17266 | 17575 | 22820 | NaN | NaN | 25519 | 19801 | 17980 | 23207 | 4402 | NaN | NaN | 25082 |
| mean | 5,078,278.001 | NaN | NaN | NaN | 0.433 | 187,022.766 | NaN | NaN | NaN | NaN | 43.290 | 14.103 | NaN | NaN | NaN | NaN | NaN | 2.201 | 21.275 | NaN |
| std | 41,788.362 | NaN | NaN | NaN | 0.747 | 101,869.023 | NaN | NaN | NaN | NaN | 11.512 | 17.257 | NaN | NaN | NaN | NaN | NaN | 0.915 | 14.870 | NaN |
| min | 5,008,805.000 | NaN | NaN | NaN | 0.000 | 27,000.000 | NaN | NaN | NaN | NaN | 21.000 | 0.000 | NaN | NaN | NaN | NaN | NaN | 1.000 | 1.000 | NaN |
| 25% | 5,042,112.500 | NaN | NaN | NaN | 0.000 | 121,500.000 | NaN | NaN | NaN | NaN | 34.000 | 3.000 | NaN | NaN | NaN | NaN | NaN | 2.000 | 9.000 | NaN |
| 50% | 5,074,692.000 | NaN | NaN | NaN | 0.000 | 157,500.000 | NaN | NaN | NaN | NaN | 42.000 | 6.000 | NaN | NaN | NaN | NaN | NaN | 2.000 | 18.000 | NaN |
| 75% | 5,114,615.500 | NaN | NaN | NaN | 1.000 | 225,000.000 | NaN | NaN | NaN | NaN | 53.000 | 15.000 | NaN | NaN | NaN | NaN | NaN | 3.000 | 31.000 | NaN |
| 99% | 5,149,808.820 | NaN | NaN | NaN | 3.000 | 560,250.000 | NaN | NaN | NaN | NaN | 66.000 | 50.000 | NaN | NaN | NaN | NaN | NaN | 5.000 | 59.000 | NaN |
| max | 5,150,482.000 | NaN | NaN | NaN | 19.000 | 1,575,000.000 | NaN | NaN | NaN | NaN | 68.000 | 50.000 | NaN | NaN | NaN | NaN | NaN | 20.000 | 61.000 | NaN |
| skew | 0.082 | NaN | NaN | NaN | 2.703 | 2.748 | NaN | NaN | NaN | NaN | 0.183 | 1.388 | NaN | NaN | NaN | NaN | NaN | 1.373 | 0.731 | NaN |
| kurt | -1.207 | NaN | NaN | NaN | 26.214 | 17.870 | NaN | NaN | NaN | NaN | -1.044 | 0.307 | NaN | NaN | NaN | NaN | NaN | 9.649 | -0.382 | NaN |
"OCCUPATION" column is the only one that counts fewer rows compared to the rest of the dataset, which implies that this feature comes with null values, so we need to check that.
We can already obtain some basic information about our applicants, but first we need to study the distribution of each numerical variable.
The skewness is positive, meaning that every distribution has a tail on the right. 'AGE' and 'ACCOUNT AGE' are the ones with the skewness closer to zero, which indicates that those are the ones closest to normality.
Kurtosis is positive and has a high value for the number of children, income, and family size. Those features are probably affected by outliers.
#We shouldn't have any NaNs exept for the 'occupation' column. Let's check it.
missingno.matrix(eda_df, color=(1, 0.38, 0.27));
This confirms our suspicion. 'OCCUPATION' features are the only ones with NaN, which probably refers to people without a job. Since it is not a numerical feature, it could be problematic to fix this.
eda_df[eda_df['OCCUPATION'].isnull()].head(10)
| ID | GENDER | HAS A CAR | OWNS REAL ESTATE | # CHILDREN | ANNUAL INCOME | INCOME TYPE | EDUCATION | FAMILY STATUS | RESIDENCE TYPE | AGE | EMPLOYMENT LENGHT | HAS A MOBILE PHONE | HAS A WORK PHONE | HAS A PHONE | HAS AN EMAIL | OCCUPATION | FAMILY SIZE | ACCOUNT AGE | HIGH RISK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 5068319 | F | N | Y | 0 | 189,000.000 | Pensioner | Higher education | Separated | House / apartment | 62 | 50 | Y | N | N | N | NaN | 1 | 3 | N |
| 14 | 5067680 | F | N | Y | 0 | 90,000.000 | Pensioner | Secondary / secondary special | Married | House / apartment | 59 | 50 | Y | N | N | N | NaN | 2 | 3 | N |
| 15 | 5111102 | M | Y | Y | 0 | 351,000.000 | Pensioner | Secondary / secondary special | Married | House / apartment | 61 | 50 | Y | N | N | N | NaN | 2 | 60 | N |
| 16 | 5054237 | F | N | N | 0 | 135,000.000 | Pensioner | Higher education | Single / not married | House / apartment | 57 | 50 | Y | N | Y | N | NaN | 1 | 15 | N |
| 20 | 5022267 | M | Y | Y | 0 | 131,400.000 | Pensioner | Secondary / secondary special | Married | House / apartment | 64 | 50 | Y | N | N | N | NaN | 2 | 9 | N |
| 24 | 5096839 | F | N | Y | 0 | 94,500.000 | Pensioner | Higher education | Single / not married | House / apartment | 61 | 50 | Y | N | N | N | NaN | 1 | 43 | N |
| 26 | 5037243 | F | N | Y | 0 | 121,500.000 | Pensioner | Secondary / secondary special | Married | House / apartment | 65 | 50 | Y | N | N | N | NaN | 2 | 14 | N |
| 29 | 5145887 | F | N | Y | 0 | 72,000.000 | Pensioner | Secondary / secondary special | Married | House / apartment | 60 | 50 | Y | N | N | N | NaN | 2 | 4 | N |
| 33 | 5021720 | F | Y | Y | 1 | 630,000.000 | Working | Secondary / secondary special | Married | House / apartment | 48 | 1 | Y | N | N | N | NaN | 3 | 13 | N |
| 35 | 5105412 | F | N | Y | 2 | 135,000.000 | Working | Secondary / secondary special | Married | House / apartment | 34 | 4 | Y | N | N | N | NaN | 4 | 26 | N |
x = eda_df['INCOME TYPE'].unique()
print(f'Income types on our df are {x}')
Income types on our df are ['Working' 'Commercial associate' 'State servant' 'Pensioner' 'Student']
There are no jobless applicants (which makes sense given that the minimum income value isn't 0).
IDs without an occupation are entries for people of various ages, income levels, and educational levels.
It could be a simple data entry error, and since we already have a lot of information about each applicant, we can just delete this column.
We'll make that decision later in the analysis.
#EDA in html
profile_report = ProfileReport(eda_df, explorative=True, dark_mode=False)
profile_report_file_path = Path('pandas_profile_file/credit_pred_profile.html')
try:
profile_report_file_path.resolve(strict=True)
except FileNotFoundError:
profile_report.to_file("pandas_profile_file/credit_pred_profile.html")
It is possible to create an interactive report that is saved as an html page using a handy python library. It is possible to consult it here.
feature = 'GENDER'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top F freq 17117 Name: GENDER, dtype: object ******************************************************* Value count: Count Frequency (%) F 17117 67.076 M 8402 32.924 *******************************************************
feature = 'HAS A CAR'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top N freq 15876 Name: HAS A CAR, dtype: object ******************************************************* Value count: Count Frequency (%) N 15876 62.212 Y 9643 37.788 *******************************************************
feature = 'OWNS REAL ESTATE'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top Y freq 17159 Name: OWNS REAL ESTATE, dtype: object ******************************************************* Value count: Count Frequency (%) Y 17159 67.240 N 8360 32.760 *******************************************************
feature = '# CHILDREN'
gen_info(feature)
draw_box_plot(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 25,519.000
mean 0.433
std 0.747
min 0.000
25% 0.000
50% 0.000
75% 1.000
max 19.000
Name: # CHILDREN, dtype: float64
*******************************************************
Value count:
Count Frequency (%)
0 17610 69.007
1 5249 20.569
2 2311 9.056
3 282 1.105
4 47 0.184
5 15 0.059
7 2 0.008
14 2 0.008
19 1 0.004
*******************************************************
feature='ANNUAL INCOME'
gen_info(feature)
draw_box_plot(feature)
draw_hist_plot(feature)
high_low_box_plot(feature)
******************************************************* Description: count 25,519.000 mean 187,022.766 std 101,869.023 min 27,000.000 25% 121,500.000 50% 157,500.000 75% 225,000.000 max 1,575,000.000 Name: ANNUAL INCOME, dtype: float64 *******************************************************
feature='EDUCATION'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 25519
unique 5
top Secondary / secondary special
freq 17266
Name: EDUCATION, dtype: object
*******************************************************
Value count:
Count Frequency (%)
Secondary / secondary special 17266 67.659
Higher education 6972 27.321
Incomplete higher 995 3.899
Lower secondary 264 1.035
Academic degree 22 0.086
*******************************************************
feature = 'FAMILY STATUS'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 25519
unique 5
top Married
freq 17575
Name: FAMILY STATUS, dtype: object
*******************************************************
Value count:
Count Frequency (%)
Married 17575 68.870
Single / not married 3362 13.174
Civil marriage 2024 7.931
Separated 1487 5.827
Widow 1071 4.197
*******************************************************
feature = 'RESIDENCE TYPE'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 25519
unique 6
top House / apartment
freq 22820
Name: RESIDENCE TYPE, dtype: object
*******************************************************
Value count:
Count Frequency (%)
House / apartment 22820 89.424
With parents 1222 4.789
Municipal apartment 772 3.025
Rented apartment 407 1.595
Office apartment 184 0.721
Co-op apartment 114 0.447
*******************************************************
feature = 'AGE'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
******************************************************* Description: count 25,519.000 mean 43.290 std 11.512 min 21.000 25% 34.000 50% 42.000 75% 53.000 max 68.000 Name: AGE, dtype: float64 *******************************************************
feature = 'EMPLOYMENT LENGHT'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
******************************************************* Description: count 25,519.000 mean 14.103 std 17.257 min 0.000 25% 3.000 50% 6.000 75% 15.000 max 50.000 Name: EMPLOYMENT LENGHT, dtype: float64 *******************************************************
feature = 'HAS A MOBILE PHONE'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 1 top Y freq 25519 Name: HAS A MOBILE PHONE, dtype: object ******************************************************* Value count: Count Frequency (%) Y 25519 100.000 *******************************************************
feature = 'HAS A WORK PHONE'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top N freq 19801 Name: HAS A WORK PHONE, dtype: object ******************************************************* Value count: Count Frequency (%) N 19801 77.593 Y 5718 22.407 *******************************************************
feature = 'HAS A PHONE'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top N freq 17980 Name: HAS A PHONE, dtype: object ******************************************************* Value count: Count Frequency (%) N 17980 70.457 Y 7539 29.543 *******************************************************
feature = 'HAS AN EMAIL'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top N freq 23207 Name: HAS AN EMAIL, dtype: object ******************************************************* Value count: Count Frequency (%) N 23207 90.940 Y 2312 9.060 *******************************************************
feature = 'OCCUPATION'
gen_info(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 17589
unique 18
top Laborers
freq 4402
Name: OCCUPATION, dtype: object
*******************************************************
Value count:
Count Frequency (%)
Laborers 4402 25.027
Core staff 2484 14.122
Sales staff 2407 13.685
Managers 2120 12.053
Drivers 1499 8.522
High skill tech staff 991 5.634
Accountants 862 4.901
Medicine staff 831 4.725
Cooking staff 463 2.632
Security staff 408 2.320
Cleaning staff 374 2.126
Private service staff 249 1.416
Low-skill Laborers 120 0.682
Waiters/barmen staff 116 0.660
Secretaries 108 0.614
HR staff 59 0.335
Realty agents 51 0.290
IT staff 45 0.256
*******************************************************
feature = 'FAMILY SIZE'
gen_info(feature)
draw_box_plot(feature)
draw_bar_plot(feature)
*******************************************************
Description:
count 25,519.000
mean 2.201
std 0.915
min 1.000
25% 2.000
50% 2.000
75% 3.000
max 20.000
Name: FAMILY SIZE, dtype: float64
*******************************************************
Value count:
Count Frequency (%)
2 13623 53.384
1 4875 19.103
3 4486 17.579
4 2203 8.633
5 270 1.058
6 43 0.169
7 14 0.055
9 2 0.008
15 2 0.008
20 1 0.004
*******************************************************
feature='ACCOUNT AGE'
gen_info(feature)
draw_box_plot(feature)
high_low_box_plot(feature)
draw_bar_plot(feature)
draw_hist_plot(feature)
******************************************************* Description: count 25,519.000 mean 21.275 std 14.870 min 1.000 25% 9.000 50% 18.000 75% 31.000 max 61.000 Name: ACCOUNT AGE, dtype: float64 *******************************************************
feature='HIGH RISK'
gen_info(feature)
draw_bar_plot(feature)
******************************************************* Description: count 25519 unique 2 top N freq 25082 Name: HIGH RISK, dtype: object ******************************************************* Value count: Count Frequency (%) N 25082 98.288 Y 437 1.712 *******************************************************
eda_df = eda_df.replace({'HAS A MOBILE PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS A WORK PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS A PHONE':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HAS AN EMAIL':{'N':0,'Y':1}})
eda_df = eda_df.replace({'HIGH RISK':{'N':0,'Y':1}})
sns.pairplot(eda_df.drop(['ID', 'HAS A MOBILE PHONE', 'HAS A WORK PHONE', 'HAS A PHONE', 'HAS AN EMAIL'], axis=1), hue='HIGH RISK', corner=True);
'# CHILDREN' and 'FAMILY SIZE' are strongly correlated, and that make sense, the more child you have the bigger is your family. Same for 'AGE' and 'EMPLOYMENT LENGHT', the more you are old the longer its your career. Having a couple of feature that are correlated between each other could be a problem later on.
#Account Age and Applicant age
sns.jointplot(x = eda_df['ACCOUNT AGE'], y = eda_df['AGE'], kind="hex", height=12)
plt.show()
Most of the users are between 25 and 50 years old, and have an account that is not older that 25 months.
#Correlation
plt.figure(figsize=(25,10))
plt.title('Correlation Matrix',fontsize=25)
corr = eda_df.corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr, annot=True, cmap='flare',mask=mask, linewidths=.5)
plt.show()
Children and Family size, age and employment are correlated with each other as we already knew. We don't have any feature correlated with HIGH RISK.
#How age affects the other variables
fig, axes = plt.subplots(4,2,figsize=(30,25),dpi=250)
fig.tight_layout(pad=9.0)
sns.boxplot(ax=axes[0,0], x=eda_df['GENDER'], y=eda_df['AGE']);
sns.boxplot(ax=axes[0,1], x=eda_df['OWNS REAL ESTATE'], y=eda_df['AGE']);
sns.boxplot(ax=axes[1,0], x=eda_df['HAS A CAR'], y=eda_df['AGE']);
sns.boxplot(ax=axes[1,1], x=eda_df['RESIDENCE TYPE'], y=eda_df['AGE']);
sns.boxplot(ax=axes[2,0], x=eda_df['AGE'], y=eda_df['FAMILY STATUS']);
sns.boxplot(ax=axes[2,1], x=eda_df['AGE'], y=eda_df['INCOME TYPE']);
sns.boxplot(ax=axes[3,0], x=eda_df['AGE'], y=eda_df['EDUCATION']);
sns.boxplot(ax=axes[3,1], x=eda_df['AGE'], y=eda_df['OCCUPATION']);
#How income affects the other variables
fig, axes = plt.subplots(4,2,figsize=(30,25),dpi=250)
fig.tight_layout(pad=9.0)
sns.boxplot(ax=axes[0,0], x=eda_df['GENDER'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[0,1], x=eda_df['OWNS REAL ESTATE'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[1,0], x=eda_df['HAS A CAR'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[1,1], x=eda_df['RESIDENCE TYPE'], y=eda_df['ANNUAL INCOME']);
sns.boxplot(ax=axes[2,0], x=eda_df['ANNUAL INCOME'], y=eda_df['FAMILY STATUS']);
sns.boxplot(ax=axes[2,1], x=eda_df['ANNUAL INCOME'], y=eda_df['INCOME TYPE']);
sns.boxplot(ax=axes[3,0], x=eda_df['ANNUAL INCOME'], y=eda_df['EDUCATION']);
sns.boxplot(ax=axes[3,1], x=eda_df['ANNUAL INCOME'], y=eda_df['OCCUPATION']);
The correlation analysis we did previously told us that our target variable is not highly correlated with any feature present in our dataset, but we need to know if any of them have some sort of effect on 'HIGH RISK' in order to build a correct model later on. In other words, we want to test whether the occurrence of a specific feature and the occurrence of a specific class are independent or not.
The Chi-square test serve this porpouse, as is used in statistics to test the independence of two events. Chi-Square measures how expected count 'E' and observed count 'O' deviates each other. When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect. This means that the higher the Chi-Square value the more dependent the feature is, and it can be selected for model training.
In the hypotesis testing the null hypothesis will be: 'The feature has no effect on target variable'. If the p-value will be higher than the alpha (99% confidence level) the null hypothesis will be accepted, otherwise it will be refuted.
feature_test = ['GENDER','HAS A CAR','OWNS REAL ESTATE','INCOME TYPE','EDUCATION','FAMILY STATUS','RESIDENCE TYPE','OCCUPATION']
chi_sq_dict = {}
for ft in feature_test:
chi_square_test(chi_sq_dict, ft)
******************** GENDER ********************
Observed values:
Count
F 271
M 166
*******************************************************
Expected values:
Count
F 218.500
M 218.500
Chi-square:
25.22883295194508
Critical value:
6.6348966010212145
P-value:
[5.09152992e-07]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "GENDER" HAS EFFECT ON TARGET
******************** HAS A CAR ********************
Observed values:
Count
N 276
Y 161
*******************************************************
Expected values:
Count
N 218.500
Y 218.500
Chi-square:
30.263157894736842
Critical value:
6.6348966010212145
P-value:
[3.77223487e-08]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "HAS A CAR" HAS EFFECT ON TARGET
******************** OWNS REAL ESTATE ********************
Observed values:
Count
N 182
Y 255
*******************************************************
Expected values:
Count
N 218.500
Y 218.500
Chi-square:
12.194508009153319
Critical value:
6.6348966010212145
P-value:
[0.0004793]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "OWNS REAL ESTATE" HAS EFFECT ON TARGET
******************** INCOME TYPE ********************
Observed values:
Count
Commercial associate 97
Pensioner 90
State servant 23
Working 227
*******************************************************
Expected values:
Count
Commercial associate 109.250
Pensioner 109.250
State servant 109.250
Working 109.250
Chi-square:
199.76887871853546
Critical value:
11.344866730144373
P-value:
[0.]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "INCOME TYPE" HAS EFFECT ON TARGET
******************** EDUCATION ********************
Observed values:
Count
Higher education 111
Incomplete higher 22
Lower secondary 8
Secondary / secondary special 296
*******************************************************
Expected values:
Count
Higher education 109.250
Incomplete higher 109.250
Lower secondary 109.250
Secondary / secondary special 109.250
Chi-square:
482.7711670480549
Critical value:
11.344866730144373
P-value:
[0.]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "EDUCATION" HAS EFFECT ON TARGET
******************** FAMILY STATUS ********************
Observed values:
Count
Civil marriage 33
Married 280
Separated 18
Single / not married 77
Widow 29
*******************************************************
Expected values:
Count
Civil marriage 87.400
Married 87.400
Separated 87.400
Single / not married 87.400
Widow 87.400
Chi-square:
553.6521739130434
Critical value:
13.276704135987622
P-value:
[0.]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "FAMILY STATUS" HAS EFFECT ON TARGET
******************** RESIDENCE TYPE ********************
Observed values:
Count
Co-op apartment 2
House / apartment 385
Municipal apartment 20
Office apartment 4
Rented apartment 6
With parents 20
*******************************************************
Expected values:
Count
Co-op apartment 72.833
House / apartment 72.833
Municipal apartment 72.833
Office apartment 72.833
Rented apartment 72.833
With parents 72.833
Chi-square:
1609.8787185354695
Critical value:
15.08627246938899
P-value:
[0.]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "RESIDENCE TYPE" HAS EFFECT ON TARGET
******************** OCCUPATION ********************
Observed values:
Count
Accountants 16
Cleaning staff 4
Cooking staff 6
Core staff 47
Drivers 35
HR staff 1
High skill tech staff 21
IT staff 3
Laborers 74
Low-skill Laborers 5
Managers 32
Medicine staff 9
Private service staff 1
Sales staff 36
Secretaries 2
Security staff 10
Waiters/barmen staff 1
*******************************************************
Expected values:
Count
Accountants 17.824
Cleaning staff 17.824
Cooking staff 17.824
Core staff 17.824
Drivers 17.824
HR staff 17.824
High skill tech staff 17.824
IT staff 17.824
Laborers 17.824
Low-skill Laborers 17.824
Managers 17.824
Medicine staff 17.824
Private service staff 17.824
Sales staff 17.824
Secretaries 17.824
Security staff 17.824
Waiters/barmen staff 17.824
Chi-square:
381.5445544554455
Critical value:
31.999926908815176
P-value:
[0.]
WE REJECT THE NULL HYPOTHESIS: THE FEATURE "OCCUPATION" HAS EFFECT ON TARGET
#Sorted Chi-square value
sortdict=sorted(chi_sq_dict.items(),key=operator.itemgetter(1),reverse=True)
print(sortdict)
[('RESIDENCE TYPE', 1609.8787185354695), ('FAMILY STATUS', 553.6521739130434), ('EDUCATION', 482.7711670480549), ('OCCUPATION', 381.5445544554455), ('INCOME TYPE', 199.76887871853546), ('HAS A CAR', 30.263157894736842), ('GENDER', 25.22883295194508), ('OWNS REAL ESTATE', 12.194508009153319)]
The average applicant is a woman (67%) who doesn't have a car (62%) but owns real estate (67%). She lives in a house/apartment (89%), she is married (68%), and either she doesn't have a child or no more than one (90%).
She earns almost 187,000 every year. She has a secondary degree (68%), started working 14 years ago, and is 43. Furthermore, she has a mobile phone (100%) but doesn't have a phone (70%), a work phone (77%) or an email (90%). She is a laborer (25%) and has been a client for 20 months. She is not considered a high-risk applicant (only 2% are).
98% of all applicants are low-risk, and that could be a problem later on. Since the number of high-risk applicants is so low, it has been difficult to figure out who is the average high-risk user. There seems to be a pattern related to age: younger people tend to have less job experience, which leads to a lower income, so it is more likely for younger people to struggle with debts.
So, to understand a little about who could be more likely to be a high-risk user, let's see how the variables are affected by age:
Men tend to be younger than women. Younger people are more likely to not own real estate, but they have a car. They are more likely to live with their parents, be single and have an incomplete education.
The same process has been applied to income:
It emerges that men have, on average, a higher income. Wealthier applicants own real estate, live in house/apartments, most of them are managers or real estate agents. People with lower incomes are students, pensioners, and younger people with an incomplete or low education. People with fewer working skills and less job experience.
Features that need to be dropped:
Features that need one-hot encoding:
Feature that needs ordinal encoding:
Features that need normalization:
Features with skewed data, that need to be reduced:
Features with outliers:
train=DataPreprocessing(train_og_copy)
#We separate train data X from the target y
y_train = train['HIGH RISK']
X_train = train.drop(['HIGH RISK'], axis=1)
#List of promising models
models = {
'sgd':SGDClassifier(random_state=42,loss='perceptron'),
'logistic_regression':LogisticRegression(random_state=42,max_iter=1000),
'decision_tree':DecisionTreeClassifier(random_state=42),
'random_forest':RandomForestClassifier(random_state=42),
'gaussian_naive_bayes':GaussianNB(),
'k_nearest_neighbors':KNeighborsClassifier(),
'gradient_boosting':GradientBoostingClassifier(random_state=42),
'linear_discriminant_analysis':LinearDiscriminantAnalysis(),
'bagging':BaggingClassifier(random_state=42),
'adaboost':AdaBoostClassifier(random_state=42),
'extra_trees':ExtraTreesClassifier(random_state=42),
'xgboost':XGBClassifier(random_state=42)
}
# loop over all the models
for model_name,model in models.items():
# title formatting
print('\n')
print('\n')
print(' {} '.center(50,'-').format(model_name))
print('\n')
# check if the folder for saving the model exists, if not create it
folder_check()
# train the model
model_trn = train_model(model,model_name)
# print the scores from the classification report
score(model_trn, model_name)
# plot the ROC curve
roc_curve(model_trn,model_name)
# plot the confusion matrix
confusion_matrix(model_trn,model_name)
# plot feature importance
feature_importance_plot(model_trn, model_name)
warnings.filterwarnings("ignore")
---------------------- sgd ----------------------
precision recall f1-score support
0 0.56 0.57 0.57 24746
1 0.56 0.56 0.56 24746
accuracy 0.56 49492
macro avg 0.56 0.56 0.56 49492
weighted avg 0.56 0.56 0.56 49492
No feature importance for sgd
---------------------- logistic_regression ----------------------
precision recall f1-score support
0 0.57 0.56 0.56 24746
1 0.57 0.57 0.57 24746
accuracy 0.57 49492
macro avg 0.57 0.57 0.57 49492
weighted avg 0.57 0.57 0.57 49492
---------------------- decision_tree ----------------------
precision recall f1-score support
0 0.98 0.98 0.98 24746
1 0.98 0.98 0.98 24746
accuracy 0.98 49492
macro avg 0.98 0.98 0.98 49492
weighted avg 0.98 0.98 0.98 49492
---------------------- random_forest ----------------------
precision recall f1-score support
0 0.99 0.99 0.99 24746
1 0.99 0.99 0.99 24746
accuracy 0.99 49492
macro avg 0.99 0.99 0.99 49492
weighted avg 0.99 0.99 0.99 49492
---------------------- gaussian_naive_bayes ----------------------
precision recall f1-score support
0 0.71 0.07 0.13 24746
1 0.51 0.97 0.67 24746
accuracy 0.52 49492
macro avg 0.61 0.52 0.40 49492
weighted avg 0.61 0.52 0.40 49492
No feature importance for gaussian_naive_bayes
---------------------- k_nearest_neighbors ----------------------
precision recall f1-score support
0 0.98 0.95 0.97 24746
1 0.95 0.98 0.97 24746
accuracy 0.97 49492
macro avg 0.97 0.97 0.97 49492
weighted avg 0.97 0.97 0.97 49492
No feature importance for k_nearest_neighbors
---------------------- gradient_boosting ----------------------
precision recall f1-score support
0 0.86 0.94 0.90 24746
1 0.93 0.85 0.89 24746
accuracy 0.89 49492
macro avg 0.90 0.89 0.89 49492
weighted avg 0.90 0.89 0.89 49492
---------------------- linear_discriminant_analysis ----------------------
precision recall f1-score support
0 0.57 0.56 0.56 24746
1 0.57 0.57 0.57 24746
accuracy 0.57 49492
macro avg 0.57 0.57 0.57 49492
weighted avg 0.57 0.57 0.57 49492
---------------------- bagging ----------------------
precision recall f1-score support
0 0.99 0.99 0.99 24746
1 0.99 0.99 0.99 24746
accuracy 0.99 49492
macro avg 0.99 0.99 0.99 49492
weighted avg 0.99 0.99 0.99 49492
No feature importance for bagging
---------------------- adaboost ----------------------
precision recall f1-score support
0 0.75 0.78 0.77 24746
1 0.77 0.74 0.76 24746
accuracy 0.76 49492
macro avg 0.76 0.76 0.76 49492
weighted avg 0.76 0.76 0.76 49492
---------------------- extra_trees ----------------------
precision recall f1-score support
0 0.99 0.99 0.99 24746
1 0.99 0.99 0.99 24746
accuracy 0.99 49492
macro avg 0.99 0.99 0.99 49492
weighted avg 0.99 0.99 0.99 49492
---------------------- xgboost ----------------------
precision recall f1-score support
0 0.99 0.99 0.99 24746
1 0.99 0.99 0.99 24746
accuracy 0.99 49492
macro avg 0.99 0.99 0.99 49492
weighted avg 0.99 0.99 0.99 49492
Now a decision needs to be made. We have a handful of promising models, but we need to find the one that best fits our needs. For a credit card company, a good model should minimize the risk of marking as low risk an applicant who's actually not low risk, or in other words, the model should have the lowest false negative score. The XGBOOST matches the description since it is the one with better precision.
In some situations, however, companies could make different decisions. For example, let's think of a scenario in which the economy is going great, the salaries are growing so as the spending, so the money is flowing. In a situation like that, a credit card company could be more interested in maximizing the number of legitimate users rather than minimizing the risk of a couple of false negative, since a greater user base could make the easier for the company to handle a handful of risky accounts. In a situation like that, maybe another model, with a higher recall rather than a good precision, could do a better job.
Good precision and good recall are mutually exclusive, so when in need of choosing between a couple of good models, we must figure out which is more important in solving the specific problem. For this project, I imagined the second scenario, so I chose the gradient boosting model.
test = DataPreprocessing(test_og_copy)
X_test = test.drop(['HIGH RISK'],axis=1)
y_test = test['HIGH RISK']
model = train_model(models['gradient_boosting'],'gradient_boosting')
predictions = model.predict(X_test)
n_correct = sum(predictions == y_test)
print(f'The model was able to correctly classify the data {(round(n_correct/len(predictions),4))*100}% of the time')
The model was able to correctly classify the data 86.2% of the time